safe reinforcement learning
Appendices
Note that this safe RL problem is less general than the standard formulation of safe RL. The authors introduce a teacher-student hierarchy. To learn the teacher's policy the following constraints are followed: a1 The unsafe set is contained in the intervention set D D The teacher learns when to intervene and to switch between different interventions. A1.2 RL with probability one constraints We have introduced the safety state to the environment as follows: s First, we discuss our design for the PI controller and discuss the necessary parts for it. The proportional part delivers brute force control by having a large control magnitude for large errors, but it is not effective if the instantaneous error values become small.
ProSh: Probabilistic Shielding for Model-free Reinforcement Learning
Court, Edwin Hamel-De le, Ohlmann, Gaspard, Belardinelli, Francesco
Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
- Europe > United Kingdom > England > Greater London > London (0.40)
- North America > United States > Maryland > Baltimore (0.14)
- Europe > Austria > Vienna (0.14)
- (11 more...)
- Education > Educational Setting > Higher Education (0.40)
- Leisure & Entertainment > Sports (0.34)
Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
Sorstkins, Andrejs, Tariq, Omer, Bilal, Muhammad
This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- Transportation > Passenger (0.48)
- Transportation > Ground > Road (0.34)
Appendices
Note that this safe RL problem is less general than the standard formulation of safe RL. The authors introduce a teacher-student hierarchy. To learn the teacher's policy the following constraints are followed: a1 The unsafe set is contained in the intervention set D D The teacher learns when to intervene and to switch between different interventions. A1.2 RL with probability one constraints We have introduced the safety state to the environment as follows: s First, we discuss our design for the PI controller and discuss the necessary parts for it. The proportional part delivers brute force control by having a large control magnitude for large errors, but it is not effective if the instantaneous error values become small.
Safe Planning and Policy Optimization via World Model Learning
Latyshev, Artem, Gorbov, Gregory, Panov, Aleksandr I.
Reinforcement Learning (RL) applications in real-world scenarios must prioritize safety and reliability, which impose strict constraints on agent behavior. Model-based RL leverages predictive world models for action planning and policy optimization, but inherent model inaccuracies can lead to catastrophic failures in safety-critical settings. We propose a novel model-based RL framework that jointly optimizes task performance and safety. To address world model errors, our method incorporates an adaptive mechanism that dynamically switches between model-based planning and direct policy execution. We resolve the objective mismatch problem of traditional model-based approaches using an implicit world model. Furthermore, our framework employs dynamic safety thresholds that adapt to the agent's evolving capabilities, consistently selecting actions that surpass safe policy suggestions in both performance and safety. Experiments demonstrate significant improvements over non-adaptive methods, showing that our approach optimizes safety and performance simultaneously rather than merely meeting minimum safety requirements. The proposed framework achieves robust performance on diverse safety-critical continuous control tasks, outperforming existing methods.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- North America > United States (0.04)
From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning
Safe reinforcement learning (RL) requires the agent to finish a given task while obeying specific constraints. Giving constraints in natural language form has great potential for practical scenarios due to its flexible transfer capability and accessibility. Previous safe RL methods with natural language constraints typically need to design cost functions manually for each constraint, which requires domain expertise and lacks flexibility. In this paper, we harness the dual role of text in this task, using it not only to provide constraint but also as a training signal. We introduce the Trajectory-level Textual Constraints Translator (TTCT) to replace the manually designed cost function.
Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation
Safe reinforcement learning (RL) is crucial for deploying RL agents in real-world applications, as it aims to maximize long-term rewards while satisfying safety constraints. However, safe RL often suffers from sample inefficiency, requiring extensive interactions with the environment to learn a safe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel approach that enhances the efficiency of safe RL through sample manipulation. ESPO employs an optimization framework with three modes: maximizing rewards, minimizing costs, and balancing the trade-off between the two. By dynamically adjusting the sampling process based on the observed conflict between reward and safety gradients, ESPO theoretically guarantees convergence, optimization stability, and improved sample complexity bounds.
Reviews: Constrained Cross-Entropy Method for Safe Reinforcement Learning
This paper studies constrained optimal control, where the goal is to produce a policy that maximizes an objective function subject to a constraint. The authors provide great motivation for this setting, explaining why the constraint cannot simply be included as a large negative reward. They detail challenges in solving this problem, especially if the initial policy does not satisfy the constraint. They also note a clever extension of their method, where they use the constraint to define the objective, by setting the constraint to indicate whether the task is solved. Their algorithm builds upon CEM: at each iteration, if there are no feasible policies, they maximize the constraint function for the policies with the largest objective; otherwise, they maximize the objective function for feasible policies.